Project Description¶
In this project, we used our data analysis skills to understand where the ingredients in an avocado toast come from. We used data from the Open Food Facts database, which has detailed information about many food items, including their countries of origin.
We focused on three main ingredients:
- Avocados
- Olive oil
- Sourdough bread (represented by lemons in the data as an example)
Each ingredient came with:
- A CSV file (like
avocado.csv
) that has information about the product and where it comes from. - A TXT file (like
relevant_avocado_categories.txt
) that lists the important food categories we need to focus on.
What We Did¶
- We filtered the data to keep only the rows that matched the relevant food categories (like fruits, oils, etc.) using the
categories_tags
column. - Each product often had more than one category, so we had to check all the tags in each row.
- We then looked at the
origins_tags
column to find out the country of origin of each item.
Goal¶
By the end of this project, we created a list showing which countries the ingredients in avocado toast most commonly come from. This helps us better understand the global supply chain behind just one simple dish.
Our final task was to find the most common country of origin for:
- Avocados
- Olive oil
- Sourdough bread
In [6]:
# import library and read tab-delimited data
import pandas as pd
avocado = pd.read_csv('avocado.csv', sep='\t')
avocado.head()
Out[6]:
code | lc | product_name_de | product_name_el | product_name_en | product_name_es | product_name_fi | product_name_fr | product_name_id | product_name_it | ... | off:ecoscore_data.adjustments.packaging.non_recyclable_and_non_biodegradable_materials | off:ecoscore_data.adjustments.production_system.value | off:ecoscore_data.adjustments.threatened_species.value | sources_fields:org-database-usda:available_date | sources_fields:org-database-usda:fdc_category | sources_fields:org-database-usda:fdc_data_source | sources_fields:org-database-usda:fdc_id | sources_fields:org-database-usda:modified_date | sources_fields:org-database-usda:publication_date | data_sources | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0059749979702 | fr | NaN | NaN | NaN | NaN | NaN | Naturalia Avocado Oil | NaN | NaN | ... | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - yuka, Apps |
1 | 7610095131409 | en | NaN | NaN | NaN | NaN | NaN | Avocado Bowl chips | NaN | NaN | ... | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - Yuka, Apps, Producers, Producer - zweifel |
2 | 4005514005578 | en | NaN | NaN | Gelbe Linse Avocado Brotaufstrich | NaN | NaN | NaN | NaN | NaN | ... | 1.0 | 15.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - yuka, Apps, App - smoothie-openfoodfacts |
3 | 0879890002513 | en | NaN | NaN | Avocado toast chili lime | NaN | NaN | NaN | NaN | NaN | ... | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - Yuka, Apps, App - InFood |
4 | 0223086613685 | en | NaN | NaN | Avocado | NaN | NaN | NaN | NaN | NaN | ... | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - Yuka, Apps |
5 rows × 184 columns
In [5]:
# Subset large DataFrame to include only relevant columns
subset_columns = [ 'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
avocado = avocado[subset_columns]
avocado.head()
Out[5]:
code | lc | product_name_en | quantity | serving_size | packaging_tags | brands | brands_tags | categories_tags | labels_tags | countries | countries_tags | origins | origins_tags | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0059749979702 | fr | NaN | NaN | NaN | NaN | Naturalia | naturalia | en:plant-based-foods-and-beverages,en:plant-ba... | NaN | Canada | en:canada | NaN | NaN |
1 | 7610095131409 | en | NaN | NaN | NaN | NaN | Zweifel | zweifel | en:snacks,en:salty-snacks,en:appetizers,en:chi... | en:vegetarian,en:vegan | Switzerland, World | en:switzerland,en:world | NaN | NaN |
2 | 4005514005578 | en | Gelbe Linse Avocado Brotaufstrich | NaN | NaN | NaN | Tartex | tartex | de:abendbrotsufstrich | en:organic,en:eu-organic,en:eg-oko-verordnung | Germany | en:germany | NaN | NaN |
3 | 0879890002513 | en | Avocado toast chili lime | NaN | NaN | NaN | NaN | NaN | NaN | NaN | United States | en:united-states | NaN | NaN |
4 | 0223086613685 | en | Avocado | NaN | NaN | NaN | NaN | NaN | NaN | NaN | United States | en:united-states | NaN | NaN |
In [25]:
# Gather relevant categories data for avocados
with open("relevant_avocado_categories.txt", "r") as file:
relevant_avocado_categories = file.read().splitlines()
file.close()
In [26]:
relevant_avocado_categories
Out[26]:
['en:avocadoes', 'en:avocados', 'en:fresh-foods', 'en:fresh-vegetables', 'en:fruchte', 'en:fruits', 'en:raw-green-avocados', 'en:tropical-fruits', 'en:tropische-fruchte', 'en:vegetables-based-foods', 'fr:hass-avocados']
In [16]:
# Turn a column of comma-separated tags into a column of lists
avocado['categories_list'] = avocado['categories_tags'].str.split(',')
# Drop rows with null values in the newly created column
avocado = avocado.dropna(subset = 'categories_list')
# Filter a DataFrame based on a column of lists
avocado = avocado[avocado['categories_list'].apply(lambda x: any(i in relevant_avocado_categories for i in x))]
avocado.head()
Out[16]:
code | lc | product_name_de | product_name_el | product_name_en | product_name_es | product_name_fi | product_name_fr | product_name_id | product_name_it | ... | off:ecoscore_data.adjustments.production_system.value | off:ecoscore_data.adjustments.threatened_species.value | sources_fields:org-database-usda:available_date | sources_fields:org-database-usda:fdc_category | sources_fields:org-database-usda:fdc_data_source | sources_fields:org-database-usda:fdc_id | sources_fields:org-database-usda:modified_date | sources_fields:org-database-usda:publication_date | data_sources | categories_list | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 3662994002063 | fr | NaN | NaN | NaN | NaN | NaN | Avocat hass | NaN | NaN | ... | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - smoothie-openfoodfacts, Apps, App - Yuka | [en:plant-based-foods-and-beverages, en:plant-... |
6 | 8437013031011 | fr | NaN | NaN | NaN | NaN | NaN | Avocat | NaN | NaN | ... | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - Yuka, Apps | [en:plant-based-foods-and-beverages, en:plant-... |
14 | 4016249238155 | de | Saatenliebe Mohn Avocado | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 15.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - yuka, Apps | [en:plant-based-foods-and-beverages, en:plant-... |
17 | 8718963381532 | de | EAT ME AVOCADO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - Yuka, Apps, App - smoothie-openfoodfacts | [en:plant-based-foods-and-beverages, en:plant-... |
23 | 8436002746707 | es | NaN | NaN | NaN | AGUACATES CASADO PREMIUM | NaN | NaN | NaN | NaN | ... | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | App - elcoco, Apps, App - yuka, App - El CoCo | [en:plant-based-foods-and-beverages, en:plant-... |
5 rows × 185 columns
Where do most avocados come from?¶
In [18]:
# Filter DataFrame for UK data
avocados_uk = avocado[(avocado['countries']=='United Kingdom')]
# Find most common country for avocado origin
avocado_origin = (avocados_uk['origins_tags'].value_counts().index[0])
avocado_origin = avocado_origin.lstrip("en:")
avocado_origin
Out[18]:
'peru'
Lets create a general function to read and filter data for a particular ingredient, and return the top origin country for that food item¶
In [3]:
def read_and_filter_data(filename, relevant_categories):
df = pd.read_csv(filename, sep='\t', low_memory=False)
# Subset large DataFrame to include only relevant columns
subset_columns = [ 'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
df = df[subset_columns]
# Split tags into lists
df['categories_list'] = df['categories_tags'].str.split(',')
# Drop rows with null categories data
df = df.dropna(subset = 'categories_list')
# Filter data for relevant categories
df = df[df['categories_list'].apply(lambda x: any([i for i in x if i in relevant_categories]))]
# Filter data for the UK
df_uk = df[(df['countries']=='United Kingdom')]
# Find top origin country string with the highest count
top_origin_string = (df_uk['origins_tags'].value_counts().index[0])
# Clean up top origin country string
top_origin_country = top_origin_string.lstrip("en:")
top_origin_country = top_origin_country.replace('-', ' ')
print(f'**{filename[:-4]} origins**','\n', top_origin_country, '\n')
print ("Top origin country: ", top_origin_country)
print ("\n")
# End of function - return top origin country for this ingredient
return top_origin_country
In [22]:
# Analyze avocado origins again, this time by calling function
top_avocado_origin = read_and_filter_data('avocado.csv',relevant_avocado_categories)
**avocado origins** peru Top origin country: peru
Repeat process above using function for the other 2 ingredients¶
In [5]:
# Read the file with UTF-8 encoding
with open("relevant_olive_oil_categories.txt", "r", encoding="utf-8") as file:
relevant_olive_oil_categories = file.read().splitlines()
file.close()
In [8]:
# Calling our function on olive_oil.csv
top_olive_oil_origin = read_and_filter_data('olive_oil.csv',relevant_olive_oil_categories)
**olive_oil origins** greece Top origin country: greece
In [32]:
# Gather relevant categories data for sourdough
with open("relevant_sourdough_categories.txt", "r") as file:
relevant_sourdough_categories = file.read().splitlines()
file.close()
In [34]:
# Call our function on sourdough.csv
top_sourdough_origin = read_and_filter_data('sourdough.csv',relevant_sourdough_categories)
**sourdough origins** united kingdom Top origin country: united kingdom